Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses
نویسندگان
چکیده
The typical process for classifying and submitting a newly sequenced virus to the NCBI database involves two steps. First, a BLAST search is performed to determine likely family candidates. That is followed by checking the candidate families with the pairwise sequence alignment tool for similar species. The submitter's judgment is then used to determine the most likely species classification. The aim of this article is to show that this process can be automated into a fast, accurate, one-step process using the proposed alignment-free method and properly implemented machine learning techniques. We present a new family of alignment-free vectorizations of the genome, the generalized vector, that maintains the speed of existing alignment-free methods while outperforming all available methods. This new alignment-free vectorization uses the frequency of genomic words (k-mers), as is done in the composition vector, and incorporates descriptive statistics of those k-mers' positional information, as inspired by the natural vector. We analyze five different characterizations of genome similarity using k-nearest neighbor classification and evaluate these on two collections of viruses totaling over 10,000 viruses. We show that our proposed method performs better than, or as well as, other methods at every level of the phylogenetic hierarchy. The data and R code is available upon request.
منابع مشابه
Evolution of viruses and cells: do we need a fourth domain of life to explain the origin of eukaryotes?
The recent discovery of diverse very large viruses, such as the mimivirus, has fostered a profusion of hypotheses positing that these viruses define a new domain of life together with the three cellular ones (Archaea, Bacteria and Eucarya). It has also been speculated that they have played a key role in the origin of eukaryotes as donors of important genes or even as the structures at the origi...
متن کاملA Novel Genetic classification of SARS coronavirus-2 following whole nucleic acid and protein alignment of the isolated viruses
Background and aims: The end of 2019 has marked the year, which the human population encountered a novel virus; SARS-CoV-2 that causes a disease namely COVID-19. Here we focused on the genome and protein mutations and subsequently suggested a new classification of the SARS-CoV-2. Materials and Methods: Our study showed that some extra positions in the virus genome play a key role in the SARS-C...
متن کاملMolecular Characterization and Phylogenetic Study of Newcastle Disease Viruses Isolated in Iran, 2014–2015
Newcastle disease (ND) is a highly contagious disease that affects many species of birds and causes significant economic losses to the poultry industry worldwide and the pathogenicity of Newcastle disease virus (NDV) strains varies with different virulence. Samples were collected from chicken commercial farms in Iran during 2014–2015. ND virus were characterized (NDV) by partial sequences...
متن کاملS7 gene Characterization of Bluetongue Viruses in Iran
Bluetongue is an infectious disease that primarily affects sheep. But due to serious socioeconomic consequence of it outbreaks on the international trade it has been included in the OIE notifiable diseases (list A). During 2007-8, total number of 130 blood samples gathered from suspected sheep to bluetongue disease in seropositive region including Khuzestan, Kurdistan, Fars, Ilam and Qum prov...
متن کاملCharacterization of Pigeon Paramyxovirus Type 1 Viruses (PPMV-1) Isolated from Iran
Newcastle disease (ND) is one of the contagious viral diseases in avian species. Recently, several ND outbreaks in pigeon caused by pigeon paramyxovirus serotype-1 (PPMV-1) have been reported in limited numbers from Iran and phylogenetic studies have been conducted on partial sequence of NDV fusion (F) gene. In the present study, ten PPMV-1, named Pigeon_paramyxovirus1_isolate_pigeon/Iran/UT_EG...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of computational biology : a journal of computational molecular cell biology
دوره 23 10 شماره
صفحات -
تاریخ انتشار 2016